-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert.py: BPE fixes? #2938
convert.py: BPE fixes? #2938
Conversation
@klosax since you're the one that suggested testing with that Aquila model: #2842 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These change should be OK to merge.
We cannot expect the current built-in BPE tokenizer to work with unicode chars. I've removed this support when porting the reference implementation from: https://github.com/cmp-nct/ggllm.cpp
It's just too much code for my taste.
If we cannot figure out an elegant way to support the BPE tokenizer (i.e. without 1000s of lines of unicode tables and handling), we probably have to drop support for it and require the user code to implement the tokenization. We can potentially expose the vocab through the API if necessary.
Thanks for the review. Let me ask this: Is there a model that convert can handle in BPE mode which is expected to work for running in In other words, is there currently a practical use case for I don't know too much about which LLaMAs use BPE, which should work, etc. With a little more information I'd be happy to add something like what I mentioned to |
I don't think so. We support LLaMA models that use the original SPM tokenizer. All other cases are experimental.
Inference works, it's just the tokenization that does not work. |
Falcon models aren't converted using Maybe there should be something like a edit: If you don't want to make further changes like that, please feel free to merge this. I think it's better than the existing behavior at least. |
Ok, let's merge for now. I'm not very sure what to do about Falcon - will figure it out as we go |
@KerfuffleV2 : please check if #3252 improves the situation. And yes, I'm still converting |
Trying to get https://huggingface.co/BAAI/Aquila-7B working but it... doesn't feel very close.
added_tokens
adds them as float: this fails in gguf since arrays have to be homogeneous. Adding it as the wrong time might have caused issues too. Fixed that.added_tokens.json
. I added some logic to fall back to trying to find the added tokens intokenizer.json
if it's available.tokenizer.json
(and the Aquila model I mentioned) the added tokens aren't necessarily unique with the main vocab and have to be filtered out if it's already in the vocab. Maybeadded_tokens.json
doesn't need this logic?USER_DEFINED
. However the token id to text rendering stuff inllama.cpp
has no case forUSER_DEFINED
so it would return an empty string 100% of the time. Fixed to add tokens withNORMAL
type.0.0
like the Falcon converter. I don't think it makes a difference what they're set to.<0xblah>
for bytes same as LLaMA - for example https://huggingface.co/kfkas/Llama-2-ko-7b-ChatYou won't get anywhere without #2889
That's apparently not enough for the Aquila model to do anything except die. Apparently it has no newline token? We just expect it to exist and blindly look it up:
llama.cpp/llama.cpp
Lines 1752 to 1757 in e8422de
Another issue with trying to get the Aquila model to do something is the JSON files are not utf-8, they're GBK apparently. On Linux at least you can convert it with something like
iconv -f gbk -t utf8 < vocab.json >vocab.json.tmp
and then rename over the original file.tokenizer.json
also needs this treatment.After all that, it sort of works for English text at least. Except it doesn't have
(space) in its vocabulary so tokenizing and all output is goingtobeallruntogethergoodtimes. It can't seem to tokenize Chinese characters at all. I also tried without the GBK to UTF8 conversion and pasting in the wonky GBK stuff from the original files but that didn't seem to work either (but I don't know that it would paste correctly since stuff on my system is set to use UTF8).
As for the other Korean model, one problem is it has no
vocab.json
whichconvert.py
's BPE mode requires but that's pretty easy to extract fromtokenizer.json
. It actually tokenizes properly (I think?) but only produces garbage output whether you speak Korean or English to it:Trying Korean
Trying English
This could be merged but hopefully we can talk about the existing problems and maybe figure out a way to fix more stuff first.